Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
Feeling a little bewildered. Lots to take in. Had forgotten about the syntax of R, been years since last time since I used it. The intro throught the book was thorough, but reading is for not the same as hands-on when it comes to coding. At least the book works as a good source for reference in the future.
https://github.com/lileinon/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Dec 13 01:46:41 2022"
The text continues here.
This week we dug into regression analysis, and model validation. Both very useful and interesting, and, I have to say, somewhat foreign to me. Too long has passeed since I last dabbed in statistics. Much reading was to be done to be able to grasp the concepts well enough. Overall I think this week gave me a good basic understanding of how to conduct regressional analysis with RStudio.
date()
## [1] "Tue Dec 13 01:46:41 2022"
This week’s dataset komes from Kimmo Vehkalahti. It contains data collected by him in a survey conducted among university students in an introductory course to social sciences. The data was collected between Dec. 2014 and Jan. 2015. The survey was conducted among 183 students, and it includes data on 56 questionnaire variables, 1 compound variable (attitude), and three contextualizing variables (age, sex, points in final exam), bringing the total to 60 columns.
This data was cleaned and made uniform as follows: - the survey measured three different approaches of learning, represented by answers in appropriate columns. These values were merged into three compound value columns corresponding to three different approaches: deep, surface (surf) and strategic (strat). The numerical value of the new columns is the mean answers pertinent to each category (original answers were on a scale of 1 to 5). - The “attitude” compound column was scaled to the same level (1-5). - the now superfluous raw data survey columns were removed, leaving 7 columns: age, sex, points, attitude, deep, surf and strat. - rows with values of zero in the “points” column (17 in total) were removed, leaving 166 rows. Since the aim of the data was to measure the impact of studying attitudes and the approaches to learning with success in exam as the main measurement of their impact, the rows without exam results were redundant. - as minor data curation procedures the naming of headers was normalized
In the following code the .csv-file containing the cleaned data is read into the variable “learning2014”, and its dimensions (dim) and structure (str) are checked to make sure eveything is as it should be.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
setwd("data")
learning2014 <- read_csv("learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014)
## [1] 166 7
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The data and its seven main variables can be summarized as follows:
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(col = learning2014$gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
The data includes more female (110) and younger (median age 22 years) participants. The strongest correlation with exam points is found with attitude of students and “strategic” approach to learning. Male participants show somewhat higher scores in attitude. Age correlated best with “strategic” approach to learning. The three variables “deep”, “strategic” and “attitude” show positive interralation. We will choose these last three as explanatory variables, with points being the target variable.
my_model <- lm(points ~ attitude + stra + deep, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + deep, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## stra 0.9621 0.5367 1.793 0.07489 .
## deep -0.7492 0.7507 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
This summary shows the significance of each explanatory variable in relation to the target variable. First it lists residuals, the difference of the predicted model and the actual value. Of the coefficients interlinked t value and p value (Pr(>|t|)) are most significant. Looking at them, it seems that the variable “deep” does not bear correlation with points (Pr(>|t| 0.31974), and is statistically non-significant. Thus we will leave it out of the equation and recalculate with two explanatory variables.
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
A slightly better result. Attitude seems to be the single most significant single factor explaining the exam result (p < 6.31e-09). Of the different approaches to studying “strategic” bears most significance. It has a p-value of 8.9%, verging on being statistically significant, but not quite. The adjusted R-squared -value of 0.1951 tells us that c. 19.5% of the points the students got from the exam can be explained by their attitude and adherence to strategic approach to learning, with attitude accounting for almost all of this effect. The p-value is 7.734e-09, very significant, and the F-statistic 20.99 is considerably larger than Critical F on 2 and 163 degrees of freedom (~3.0). Therefore we can reject the null hypothesis.
What remains are regression diagnostics, shown below in three tables.
par(mfrow = c(2,2))
plot(my_model2, c(1, 2, 5))
The residuals vs Fitted model -graph confirms that the fitted model is appropriate; the residuals and the fitted values are uncorrelated. The Normal Q-Q shows no evidence of major departure from linearity, confirming the sample data follows normal distribution and is not skewed. Residuals vs Leverage shows no data points fall outside Cook’s distance, and therefore there are no influential individual datapoints. Everything checks out.
This week was all about the title. The statistics were interesting and the tools powerful. Should be useful in the future. The only problem I continuously have with R is that every function seems to be like a “magic box”. I just put in some numbers, and it does its magic in secret, spewing out magnificent graphs and whatnot, but I always feel like I have no clue or maybe even control how it does what it does. Like driving with an autopilot: just enter the destination, and the car takes care of everything. Can I say that I drove there? Or even that I can drive, if all I do is input destinations? Each function works differently; this seems not so much about learning general R syntax, but instead learning which function does what, and what you need to feed it. Perhaps the feeling ensues because earlier on I have mostly done everything step-by-step by myself, each graph and such. This feels almost like cheating at times, and on the other hand like giving away control.
date()
## [1] "Tue Dec 13 01:46:49 2022"
Also, loading all the needed libraries here.
library(scales); library(dplyr); library(tidyverse); library(ggplot2); library(boot)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
Okey dokey. This week’s dataset comes from Portugal. It was collected by P. Cortez and A. Silva for a paper published in 2008, and the raw data was made available on UCI Machine Learning Repository in 2014 (link, with info on the variables etc.). Originally the dataset consisted of two distinct but overlapping questionnaires conducted in a math class and portuguese class. The questionnaire gathered much background info of the students, e.g. their domestic situation, siblings, the occupation of the parents, and the student’s alcohol consumption (full list of column names below; only the chosen variables explained in detail). As target variables the data included three grades, G1 and G2 the grades of first two trimesters, and G3 as the grade for the third trimester and for the whole course at the same time. All in all the data consists of 33 variables, of which 27 were contextualizing variables describing the student’s background and social status, and 6 changing variables connected with how well they did in class (the three grade -columns; absences; previous failures of the course; and, whether the student had taken extra paid classes on the course subject or not).
The dataset for math students included 395 students and for portuguese 649 students. Although anonymized, the background questions allowed combining students who had answered both questionnaires, and selecting only them . This left us with 370 students. Their grades were averaged for a single number. The questionnaire’s two columns on alcohol consumption (graded on a scale of 1-5) were combined into one column, alc_use, and those whose weekly dosage grade surpassed 2 were then marked out into separate high_use -column. The meaning of this analysis is to study the relationship between high/low alcohol consumption and some of the other variables in the data.
setwd("data")
alc <- read_csv("alc.csv", show_col_types = FALSE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The 4 variables I have chosen to be analysed in relation with alcohol consumption are:
Going out a lot with friends would probably be most highly correlated with high alcohol consumption. High alcohol consumption also could cause absences from classes. The last two, on the other hand, are such that they would limit ones ability to partake in sociable drinking.
This will be a “preliminary” probe, with brief overviews on each chosen variable.
Absences is a numeric value with amount of absences, high_use is a binary. Correlation, if any, should be visible from a simple boxplot.
plot_absences <- ggplot(alc, aes(x = high_use, y = absences))
plot_absences + geom_boxplot() + ggtitle("1. Amount of absences vs. high use of alcohol")
The hypotheses seems to be corrected, and the correlation significant.
Going out was graded on a scale from 1 to 5, and high_use is a binary. A barplot should reveal if there are any differences between the two groups of high-users and low-users.
plot_goout <- ggplot(alc, aes(x = goout))
plot_goout + geom_bar() + facet_wrap("high_use", labeller = label_both) + ggtitle("2. going out with friends vs. high alcohol usage")
As visible from the barplot, the high users do go out a lot more than the low users. The barplot leans heavily to the right. A simple check of mean values and distribution percentages confirms the hypotheses.
alc %>% group_by(high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 2 × 3
## high_use count mean_goout
## <lgl> <int> <dbl>
## 1 FALSE 259 2.85
## 2 TRUE 111 3.73
alc %>% group_by(high_use, goout) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups: high_use [2]
## high_use goout count percentage
## <lgl> <dbl> <int> <chr>
## 1 FALSE 1 19 7.34%
## 2 FALSE 2 82 31.66%
## 3 FALSE 3 97 37.45%
## 4 FALSE 4 40 15.44%
## 5 FALSE 5 21 8.11%
## 6 TRUE 1 3 2.7%
## 7 TRUE 2 15 13.5%
## 8 TRUE 3 23 20.7%
## 9 TRUE 4 38 34.2%
## 10 TRUE 5 32 28.8%
Extra-curricular activities is a binary value, as is high_use. A simple check of the amounts might give us an overview
alc %>% group_by(high_use, activities) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups: high_use [2]
## high_use activities count percentage
## <lgl> <chr> <int> <chr>
## 1 FALSE no 120 46.3%
## 2 FALSE yes 139 53.7%
## 3 TRUE no 59 53.2%
## 4 TRUE yes 52 46.8%
Perhaps surprisingly not a significant difference in extracurricular activities between the two groups; both basically break even. Further inquiries regarding this variable would be futile.
Weekly study time is a numeric field (values: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours). A barplot should do the trick for overview of correlation.
plot_studytime <- ggplot(alc, aes(x = studytime))
plot_studytime + geom_bar() + facet_wrap("high_use", labeller = label_both) + ggtitle("4. Weekly study time vs. high alcohol usage")
alc %>% group_by(high_use, studytime) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 4
## # Groups: high_use [2]
## high_use studytime count percentage
## <lgl> <dbl> <int> <chr>
## 1 FALSE 1 56 21.6%
## 2 FALSE 2 128 49.4%
## 3 FALSE 3 52 20.1%
## 4 FALSE 4 23 8.9%
## 5 TRUE 1 42 37.8%
## 6 TRUE 2 57 51.4%
## 7 TRUE 3 8 7.2%
## 8 TRUE 4 4 3.6%
As can be seen, almost 90% of high alcohol users studied less than 5 hours per week, whereas lower alcohol users are more normally divided, with much more time used on studies. Correlation is strong here.
Okey dokey. First the code, then the analysis.
m <- glm(high_use ~ absences + goout + activities + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + goout + activities + studytime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9196 -0.7732 -0.5083 0.8446 2.5676
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.34944 0.53871 -4.361 1.29e-05 ***
## absences 0.06961 0.02213 3.145 0.00166 **
## goout 0.73572 0.11959 6.152 7.64e-10 ***
## activitiesyes -0.31151 0.25661 -1.214 0.22476
## studytime -0.56049 0.16914 -3.314 0.00092 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 375.68 on 365 degrees of freedom
## AIC: 385.68
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.09542238 0.03227113 0.2680871
## absences 1.07208689 1.02777425 1.1223836
## goout 2.08697580 1.66084887 2.6570839
## activitiesyes 0.73234019 0.44132598 1.2094464
## studytime 0.57093162 0.40522569 0.7880221
Absences, going out and studytime are significant statistically (p-values 0.00166, 7.64e-10 and 0.00092 respectively). Activities, on the other hand, are statistically insignificant as noted already earlier. The correlations were also correctly assumed in the preliminary hypotheses in regard of the three statistically significant variables.
OR and CI show that 1 absence equals 7 percent heightened likelihood of the person belonging to the group of high users, with values in real population falling between 2.7 and 12.2 percent. Similarly belonging to 1 higher group in studytime meant 43 percent reducted likelihood to belong to high users, real life population values between 59.5 and 21.2 percent. Belonging to higher group in going out -variable, on the other hand, meant a whopping 108 percent heightened likelyhood to be a high alcohol user, with real population values between 66 and 165.7 percent. The CI of activities includes 1, which shows its effect, positive or negative, can not be speculated in real population.
We’ll use the statistically significant variables found, absences, goout and studytime.
m2 <- glm(high_use ~ absences + goout + studytime, data = alc, family = "binomial")
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc2'
alc2 <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 239 20
## TRUE 63 48
g <- ggplot(alc2, aes(x = probability, y = high_use))
# define the geom as points and draw the plot
g + geom_point(aes(col = prediction))
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64594595 0.05405405 0.70000000
## TRUE 0.17027027 0.12972973 0.30000000
## Sum 0.81621622 0.18378378 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc2$high_use, prob = alc2$probability)
## [1] 0.2243243
Loss function gives 22.4% falsely predicted individuals. The false positives and false negatives in cross-tabulation gives 17.0% + 5.4% = 22.4% of falsely predicted individuals, the same number. This is the total proportion of inaccurately classified individuals (= the training error).
Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model? (0-2 points to compensate any loss of points from the above exercises)
cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2351351
It seems this model has better test set performance (0.23) than the Exercise Set (0.26).
Reflections: This week was horribly busy for me, and also I became ill. I barely had the time to do the exercises and the assignments. I did skim-read through Kimmo’s book chapters also, but I must say that the whole concept of how clustering works seems a bit …vague? Yes, it works, but seems that every method either/both of 1) researcher doing guesswork of where to cut the clusters or 2) the clustering having such a random-factor that the results can only be repeated by forcing a fixed random-seed. Which sounds rather strange and does not make one feel that the process is even possible to be firmly grasped. But, I will try my best. As mentioned in the book, clustering is a basic concept of reasoning, and an absolutely necessary tool for data analysis.
date()
## [1] "Tue Dec 13 01:46:51 2022"
Done.
# I will load all the needed libraries and the dataset here for convenience
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyverse)
library(corrplot)
## corrplot 0.92 loaded
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
The Boston dataset is a classic dataset that is distributed as part of the basic R package. Its title, “Housing Values in Suburbs of Boston” describes its origins well enough. The dataset was gathered for the 1978 article “Hedonic prices and the demand for clean air” (J. Environ. Economics and Management 5, 81–102) by D. Harrison and D.L. Rubinfeld. The 14 variables map factors that affected housing prices in suburbs (called “towns” in documentation; we’ll use the same term henceforth) around and include information on - for instance - crime rate, amount of nitrogen oxides in air, and access to Boston’s radial highways. Data has been already cleaned; it entails 506 rows of observations and 14 columns of variables.
The variables are:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
cor_matrix <- cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Many very natural distributions, as with airborne NoX -gases. Yet, some of the values are uneven. Crime-variable for instance varies from 0.006 to 88.976, with a median of 0.25 and mean of 3.61 - it seems that there is a huge variety (and a lot of outliers) in crime between towns, with a few of them accounting for the majority of the cases. Non-retail business, on the other hand, are divided more evenly, but wealth is clearly unevenly divided (lstat and medv). Amount of POC (variable black) does also seem to probably have a lot of outliers (min: 0.32, 1st qu: 375.38, mean 356.67). Some variables seem to bear some interconnection due to similar spread (for instance dis, rad, zn).
As can be seen, some of these values mentioned above are interconnected. Lower status and median value of homes are negatively correlated to a very high degree. Amount of rooms correlates positively with more expensive houses, as can be expected. Interestingly, the dis-variable (distances to employment centres) reveals societal structures and characters of the economical centers: older housing and smaller lots on the zoned land, more (non-retail) businesses which bring more customers and employees driving daily by cars, and hence more nitrogen dioxide in the air; also more crime and more lower status inhabitants. This dynamic seems to work as a sort of a heuristic key on how to read the data. As another instance of the same phenomenon the variable rad “access to radial highways” connects all the datapoints pertinent to the same dynamic: highways connect major centers. Crime rate is connected with this variable to a high degree for this reason.
As an additional check, I will do some boxplots of some of the variables mentioned in the summary analysis. The amount of outliers makes it clear that we are dealing with data that is not evenly spread, and should be checked for clusters. As a preliminary hypothesis we might say that most likely they would be built around the dynamic of more populous and dense commercial towns vs. less populous smaller towns, with much less crime, pollution, commerce, etc.
par(mfrow = c(1, 5))
boxplot(Boston$crim, main = "crim", col = "purple", outcol = "red")
boxplot(Boston$zn, main = "zn", col = "blue", outcol = "red")
boxplot(Boston$lstat, main = "lstat", col = "yellow", outcol = "red")
boxplot(Boston$medv, main = "medv", col = "orange", outcol = "red")
boxplot(Boston$black, main = "black", col = "green", outcol = "red")
Next, we will standardize the dataset and print out summaries of the scaled data. The data will be scaled so that all the mean values of all the variables will be 0 [zero].
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Then, as mentioned in the chapter title, we will create a categorical variable to replace the old crime value with, and the training and testing datasets.
boston_scaled$crim <- as.numeric(boston_scaled$crim)
bins <- quantile(boston_scaled$crim)
labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Next we’ll fit the linear discriminant analysis on the train set. Categorical crime rate is the target variable, with the other variables as predictors.
lda.fit <- lda(crime ~ ., data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The variable rad has highest correlation with crime rate, and mainly through it we ware able to cluster the towns with highest crime rates. The second and third most important factors are nox and zn, which basically differentiate the towns in order of their population density.
Next we’ll use the test set data to predict the crime rate classes, and cross-tabulate them with actual results. First we’ll save them as their own variable.
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 14 0 0
## med_low 2 10 5 0
## med_high 1 11 16 2
## high 0 0 0 27
As can be seen, the model does predict most of the cases correctly, with more accuracy in the low and high ends of the spectrum.
Following the assignment, we’ll first reload the dataset and standardize it to bring the means to zero.
data(Boston)
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Next, we’ll count the Eucledian distances between the observations, and check out the summary of the values.
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next we’ll run a k-means algorithm on the dataset.
set.seed(13)
km <- kmeans(boston_scaled, centers = 4)
pairs(boston_scaled, col = km$cluster)
Four clusters seems like a lot. Let’s try to gauge the optimal number of clusters via WCSS -analysis. A sharp turn in the graph should indicate where wouuld be a good value to make the cut.
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
Optimal point for clustering seems to be 2. Let’s rerun the k-means analysis with 2 clusters.
set.seed(13)
km <- kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
Not very informative due to the small size. Let’s take into closer look the variables we denoted correlated earlier: crim, zn, nox, dis, rad, lstat and medv.
pairs(boston_scaled[, c(1, 2, 5, 8, 9, 13, 14)], col = km$cluster)
Some of these might be still dropped, but in the main it seems that these two clusters denote meaningful and significant differences between two groups of opservations in the data. The commercial and economical centers have smaller lots, poorer air quality, better access to ringroads, more poor people, and less expensive homes. Further analysis would be needed to elaborate more.
Just not enough time. Being ill with a small baby has its drawbacks
Yet, this seems intriguing. Who doesn’t want more dimensions? Gotta try, I’ll give it “15 minutes”, and just list the instructions here as I go through them. Not for the points, but for the coolness factor of 3D thingies.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=classes)
It is beautiful. Oh wait, I can move it! C O O L
# plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$cluster)
Okey dokey. This produces error. Probably because the km$cluster is not part of the data matrix that we created… I should join the values from the cluster-column to the train-set, and then create the data matrix and the 3D-plot again… but this time I think I will just go to bed and sleep.
Peace out.
date()
## [1] "Tue Dec 13 01:47:01 2022"
Some reflections: has been a really tiring week, and I’m writing this in the middle of the night. So, the analyses are a little bit more lively than usual, sorry for that. Not much sleep due to having a small baby, but I’m still glad that I have managed to do this course. This week reading through Kimmo’s work I kinda felt that I actually understood the concepts. The same was with the assignments: reading through them they seemed to be rather easy and clear, much more focused on rehearsing or repeating what was already learned earlier, than on the previous weeks. However, the schedules and sleep deprivations proved such that I was not able to do the exercises and course diary until tonight.
library(dplyr)
library(tidyverse)
library(ggplot2)
library(GGally)
library(corrplot)
library(FactoMineR)
getwd()
## [1] "C:/Users/lauri/Desktop/IODS7/IODS-project"
setwd("data")
human <- read.csv("human.csv", row.names = 1)
This week’s data comes from the UN. The data originally comes from two different reports, those of the Human development index (HD) and Gender inequality index (GII). These indexes are measured through various variables (the HD 8 variables and GII 12 variables), and each country and region is ranked according to their index. More information on the data and the variables here and here.
The data was first combined based on the country’s name. At this point the data included 195 observations of 19 variables. The data was then modified by creation of two combined columns: ratio of females to males in secondary education, and the same ratio in labor force (these were split by gender in the original data). So: labF / labM -> labFM and mutatis mutandis the same for secondary education -data. The rows with missing data, and the rows which combined information on continent -level were removed. The country names were transferred from “country” -column to row names. And finally, only the columns pertinent to this inquiry were kept, amounting to 8 variables:
Let’s take a look at the variables.
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
cor(human) %>% corrplot
ggpairs(human)
A lot to unpack. First of all, some variables have really wild scales with massive differences. Mat.Mor varies from 1 to 1100, with median 49; GNI varies from 581 to 123124 with a median of 12040; Ado.Birth also varies from 0.6 to 204.8 with a median of 33.6. Some also seem skewed, with means differing from the median significantly. Huge differences between countries, but also huge differences between the scales of each variable (some have max values close to 1, some around 50-200, and GNI over 120000). According to Kimmo’s book p. 246, “when variables are on very different scales or have very different variances, a PCA of the data should be performed on the correlation matrix” - just to keep this in mind.
Looking at the correlation plots it’s easy to see that the variables do have massive correlations. Mat.Mor and Ado.Birth correlate strongly and negatively with GNI, Edu2.FM, Life.Exp and Edu.Exp. This division between the two first mentioned with almost identical correlations, and the four latter with almost identical correlation plots - but inverse from the first two - characterizes the whole data. Rich against poor. The rich live longer and receive better education, while the poor countries are here characterized by maternal mortality and adolescent births, which reflect negatively on the other variables.
Women participating in labor force (or is “work force” the correct term?) and women in parliament do not seem to tip scales much.
Rrrright. As mentioned, the scales are wildly different between variables. So, we already know where this is going, but let’s do it anyway.
pca_human <- prcomp(human)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
The almighty GNI explains everything. Perhaps stupidly drawn plots like these explain modern economic policies? Be as it may, this approach certainly does not work very well in visualizing our data: one variable, GNI, is on massively larger scale compared to the others, and therefore dominates the output, explaining 99% of the variance only because its numbers are the biggest and baddest. Let’s try this again with more reason.
So, the same again, but with standardized data and some labels.
human_std <- scale(human)
pca_human <- prcomp(human_std)
s <- summary(pca_human)
# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
A bit more sensible outcome. Now we can actually see the significant factors at work. As noted earlier, the 2 vs 4 -dynamic is what separates the countries most, and accounts for 53.6% of the variance. Interestingly here we can also see the Parli.F and Labo.FM -variables, which account for 16.2% of the variance.
Right. I kinda feel like I already gave most of my interpretation earlier, but let’s recap. The primary dynamic in the data is the division between poor and rich countries. Rich countries with high GNI score better also on life expectancy, expected secondary education and ratio of women in secondary education (to the left in the chart).
Poorer countries fare worse in these factors, but the main visible factors were maternal mortality and adolescent births: the countries with high numbers in these variables were very distinctly set apart from the countries in the first group, and are found to the right from the origo.
Interestingly, the ratio of women in parliamentary roles or participating in the work force were not significantly correlated with this rich-poor -dynamic, but formed a secondary axis within this division. It seems women in parliament and/or workforce are not as important as education in relation to the other variables.
Sorry about that, don’t know what got into me. Back to R.
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
view(tea)
# I'll remove the age-variable, as it's the only numerical one, and the following pivot_longer doesn't work with it
tea <- select(tea, -age)
# In two parts
pivot_longer(tea[1:18], cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(size = 6))
pivot_longer(tea[19:35], cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(size = 6))
Plenty of variables. Tea is very interesting, I must say. As per the assignment, let’s select some variables, and do the MCA. We’ll go with tea, age_Q, escape.exoticism, exciting, relaxing and spirituality, because why the hell not. Should be weird. Maube we’ll find out what tea people in different age groups drink to relax and have spiritual, exotic or escapistic experiences. For science!
keep_columns <- c("Tea", "age_Q", "escape.exoticism", "exciting", "relaxing", "spirituality")
t_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# check out the new data
str(t_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
summary(t_time)
## Tea age_Q escape.exoticism exciting
## black : 74 +60 :38 escape-exoticism :142 exciting :116
## Earl Grey:193 15-24:92 Not.escape-exoticism:158 No.exciting:184
## green : 33 25-34:69
## 35-44:40
## 45-59:61
## relaxing spirituality
## No.relaxing:113 Not.spirituality:206
## relaxing :187 spirituality : 94
##
##
##
pivot_longer(t_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
Next, the MCA.
mca <- MCA(t_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = t_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.255 0.243 0.194 0.174 0.172 0.160 0.148
## % of var. 15.289 14.556 11.622 10.434 10.306 9.619 8.894
## Cumulative % of var. 15.289 29.845 41.466 51.900 62.206 71.824 80.718
## Dim.8 Dim.9 Dim.10
## Variance 0.130 0.109 0.082
## % of var. 7.829 6.518 4.935
## Cumulative % of var. 88.547 95.065 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.851 0.948 0.330 | -0.466 0.299 0.099 | -0.040
## 2 | 0.674 0.595 0.232 | 0.476 0.311 0.115 | -0.909
## 3 | 0.134 0.023 0.015 | -0.198 0.054 0.033 | -0.237
## 4 | -0.921 1.109 0.692 | -0.340 0.158 0.094 | -0.229
## 5 | -0.413 0.223 0.114 | -0.075 0.008 0.004 | -0.778
## 6 | -0.374 0.183 0.155 | -0.463 0.294 0.238 | 0.312
## 7 | 0.096 0.012 0.006 | -0.592 0.482 0.218 | 0.183
## 8 | 0.554 0.402 0.152 | -0.897 1.106 0.398 | 0.110
## 9 | 0.096 0.012 0.006 | -0.592 0.482 0.218 | 0.183
## 10 | 0.554 0.402 0.152 | -0.897 1.106 0.398 | 0.110
## ctr cos2
## 1 0.003 0.001 |
## 2 1.423 0.421 |
## 3 0.097 0.048 |
## 4 0.090 0.043 |
## 5 1.042 0.403 |
## 6 0.168 0.108 |
## 7 0.058 0.021 |
## 8 0.021 0.006 |
## 9 0.058 0.021 |
## 10 0.021 0.006 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.888 12.719 0.258 8.785 | -0.731 9.068
## Earl Grey | -0.501 10.567 0.453 -11.638 | 0.170 1.270
## green | 0.940 6.353 0.109 5.713 | 0.649 3.181
## +60 | 1.053 9.188 0.161 6.935 | -0.593 3.055
## 15-24 | -0.957 18.365 0.405 -11.004 | -0.354 2.636
## 25-34 | -0.088 0.116 0.002 -0.830 | 0.845 11.288
## 35-44 | 0.466 1.896 0.033 3.163 | -0.736 4.961
## 45-59 | 0.581 4.484 0.086 5.072 | 0.429 2.572
## escape-exoticism | -0.248 1.897 0.055 -4.057 | 0.083 0.225
## Not.escape-exoticism | 0.222 1.705 0.055 4.057 | -0.075 0.202
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.175 -7.238 | -0.266 1.499 0.023 -2.630 |
## Earl Grey 0.052 3.937 | -0.072 0.291 0.009 -1.683 |
## green 0.052 3.944 | 1.020 9.847 0.129 6.200 |
## +60 0.051 -3.902 | 0.482 2.537 0.034 3.177 |
## 15-24 0.055 -4.068 | 0.144 0.548 0.009 1.657 |
## 25-34 0.213 7.988 | 0.811 13.029 0.197 7.668 |
## 35-44 0.083 -4.991 | -0.196 0.441 0.006 -1.330 |
## 45-59 0.047 3.749 | -1.307 29.887 0.436 -11.418 |
## escape-exoticism 0.006 1.364 | -0.702 20.070 0.443 -11.507 |
## Not.escape-exoticism 0.006 -1.364 | 0.631 18.038 0.443 11.507 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.453 0.197 0.135 |
## age_Q | 0.521 0.357 0.540 |
## escape.exoticism | 0.055 0.006 0.443 |
## exciting | 0.008 0.506 0.005 |
## relaxing | 0.190 0.380 0.037 |
## spirituality | 0.303 0.009 0.002 |
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
As we can see, these chosen variables do not account for much variance in the data: only 15.29 + 14.56 = 29.85%. So, all the following analyses should be taken with a pinch of salt. Or a handful. But, it seems that there is a NO-camp and a YES-camp on the question of whether tea as a substance awakens feelings, such as spirituality, escapism, excitement and/or relaxation. The younger one is, the more prone one is to succumb to such frivolous emotions. The NO-camp, proudly heralded by the 60+ -class, considers tea not exciting, not relaxing, not spiritual and most definitely not in anyway connected with escapist exoticism. The best tea to drink in the NO-camp is black (as ones soul), but some also go with green (as envy). Earl Grey, on the other hand, is for the babies in the YES-camp.
On a more serious note, not much clustering. The oldest age groups contributed most to the analyses, followed by black and green teas (judging by the values on the v-test). But, at the moment I think I’m just too tired to analyse anything further. I’ll just have a nice cup of Earl Grey and commit these to the Hub.
date()
## [1] "Tue Dec 13 01:47:12 2022"
Reflections: doing this the last day. Has been exhausting; been sick with a tenacious flu, and the baby doesn’t like to sleep. Reflecting more on the whole course than just this week. I had used R briefly earlier, but for very different uses, and in a very different way. And of course some 6 years ago or so. So everything was not new, yet new. I found out that I need to grasp statistical concepts more firmly - my understanding in them is on too shaky a ground to make me feel comfortable doing analyses.
Overall a good course in such sense that it gave me glimpses of something else. Yet, I cannot say that I would master these methods after this brief intro. So, perhaps next I need to take deeper looks into statistical analyses on the ground level and R in connection to them. My approach seems to be inverse: starting from the more complicated, advancing to the simpler stuff. Yet, I think that the analyses I aim to do with R might benefit from this course. I just need some more time with the fundamentals. Perhaps more hands-on “DIY”-thing would have been better: this course was intro, and as such, entailed lots of copying of ready code. That of course is what we all do, but to get something to stick to ones spine, a workflow to become the norm, one has to really hack it deep into the ape-brain. And for me the only way to achieve this is by repetition. So, perhaps the next step for me is doing some reps to get some gains on them old muscles; here in this course I had the PT to give me pointers on the technique.
Onwards to the assignment. This seems very applicative. But a lot of work, and I have only a few hours. I’ll do what I can; first the graphs and such, and if I have time I will add the verbal analysis.
Just couldn’t keep my eyes open any longer, so did not finish the assignment properly. Most of the charts are done, but what is lacking is the analysis. Also the neat and tidy formulations to R Markdown are not done: some of the explanations are inside code blocks.
Apologies in advance, I’m afraid this will not be that easy to follow or give points.
Here I’ll load all the necessary libraries
library(tidyverse)
library(ggplot2)
library(tidyr)
library(dplyr)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
Import data from the data wrangling set, check everything is ok.
RATSL <- read_csv("data/rats.csv")
## Rows: 176 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): WD
## dbl (4): ID, Group, Weight, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <dbl> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
# as evident from the glimpse, gotta redo the categorical variables into factors
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
Okey dokey. The assignment was to reproduce the charts and tables of MABS-book chapter 8 with RATS-data. I will go through the charts and tables by their numbers in the book, and redo them if possible. Analysis will be added as the final stage, if I have time.
NB! There are some charts / tables that I will not reproduce, since they are not relevant for the analysis. Such tables are for instance an example of the first 40 rows of data, and Table 8.2 “Possible summary measures”, which is not calculated from the data.
IF NEED BE My data RATSL is in long form, which might not be the most useful for all the things needed. So, just in case, I will also reload the raw data in wide form.
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",
header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
str(RATS)
## 'data.frame': 16 obs. of 13 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD1 : int 240 225 245 260 255 260 275 245 410 405 ...
## $ WD8 : int 250 230 250 255 260 265 275 255 415 420 ...
## $ WD15 : int 255 230 250 255 255 270 260 260 425 430 ...
## $ WD22 : int 260 232 255 265 270 275 270 268 428 440 ...
## $ WD29 : int 262 240 262 265 270 275 273 270 438 448 ...
## $ WD36 : int 258 240 265 268 273 277 274 265 443 460 ...
## $ WD43 : int 266 243 267 270 274 278 276 265 442 458 ...
## $ WD44 : int 266 244 267 272 273 278 271 267 446 464 ...
## $ WD50 : int 265 238 264 274 276 284 282 273 456 475 ...
## $ WD57 : int 272 247 268 273 278 279 281 274 468 484 ...
## $ WD64 : int 278 245 269 275 280 281 284 278 478 496 ...
-> this is just giving overview of the data, we’re not going to repeat it.
# has to have the linetype = ID for this graph to work.
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
Individual response profiles for BPRS data after standardization.
# same as above, but with standardized values
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdweight = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "standardized weight")
Mean response profiles for the two treatment groups in the BPRS data.
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
# glimpse(RATSS)
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
geom_point(size=3) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
Boxplots for the BPRS data.
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1…
## $ Weight <dbl> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
ggplot(RATSL, aes(x = WD, fill = Group, y = Weight))+
geom_boxplot()
Possible Summary Measures] –> This is something else, not needed
Boxplots of mean summary measures for the two treatment groups in the BPRS data.
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
RATSL_1 <- RATSL %>%
group_by(Group, ID) %>%
summarise(mean = mean(Weight)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSL_1)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
# Draw a boxplot of the mean versus treatment
ggplot(RATSL_1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
scale_y_continuous(name = "mean(weight)")
Boxplots of mean summary measures for the two treatment groups in the BPRS data, without the outlier shown in Figure 8.5.
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL_2 <- RATSL %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
filter(mean < 590) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSL_2, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
scale_y_continuous(name = "mean(Weight)")
I removed just one outlier; not sure if here was intended to remove all of them.
All of the three following are “grouped together” for the Anova -analysis
Results From an Independent Samples t-test on the Mean Summary Measure for the BPRS Data, Without the Outlier Shown in Figure 8.5
Results from an Analysis of Covariance of the BPRS Data with Baseline BPRS and Treatment Group as Covariates
Results from an Independent Samples t-test for the Mean Summary Measure Used on the Data Partially Shown in Table 8.5. (a) Leaving Out Subjects with Any Missing Value, (b) Mean of Available Values for Each Subject
# Not sure what needs to be done here, so I'll recreate what was in the Exercise
# t.test(mean ~ Group, data = RATSL11, var.equal = TRUE) NOT suitable, because three factors, not two: just Anova, then
# Add the baseline from the original data as a new variable to the summary data
RATSL_3 <- RATSL_1 %>%
mutate(baseline = RATS$WD1) # The original raw RATS data was needed after all!
RATSL_3 <- RATSL_3 %>%
filter(mean < 590)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL_3)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 207029 207029 3272.6190 5.751e-15 ***
## Group 2 1226 613 9.6878 0.003748 **
## Residuals 11 696 63
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Load data from data wrangling
BPRSL <- read_csv("data/bprs.csv")
## Rows: 360 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): weeks
## dbl (3): treatment, subject, bprs
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(BPRSL)
## Rows: 360
## Columns: 4
## $ treatment <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <dbl> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
# Same as above, gotta redo the categorical variables into factors
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
Same thing here as above, just in case loading the raw data
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt",
sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
Same thing as above, will do the charts and tables (and list them here with reference to the book), but did not have time to write analysis.
NOTE: all the chart “names” will refer to Rats, even though I am using BPRS here, because the chart names came from the book.
##TABLE 9.1 Body Weights of Rats Recorded Over a 9-Week Period
This is just viewing the data, i.e.
view(BPRS)
Long Form of the Data for the First Two Rats in Group 1 in Table 9.1
#already done in wrangling, just glimpsing here
glimpse(BPRSL)
## Rows: 360
## Columns: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <dbl> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
Plot of weight against time for rat data, ignoring the repeated-measures structure of the data but identifying the group to which each observation belongs.
RRrright… With the BPRS -data this is probably close enough
ggplot(BPRSL, aes(x=weeks, y = bprs, group = treatment, colour = treatment)) +
geom_point()
Results from Fitting a Linear Regression Model to Rat Data with Weight as Response Variable, and Group and Time as Explanatory Variables, and Ignoring the Repeated-Measures Structure of the Data
BPRSL_week <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
BPRSL_lm <- lm(bprs ~ week + treatment, data = BPRSL_week)
summary(BPRSL_lm)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL_week)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Plot of individual rat growth profiles.
ggplot(BPRSL_week, aes(x = week, y = bprs, linetype = subject, colour = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "bprs")
Scatterplot matrix of repeated measures in rat growth data.
BPRSL_week_only <- BPRSL_week[, -3]
pairs(BPRSL_week_only)
Results from Fitting Random Intercept Model, with Time and Group as Explanatory Variables, to Rat Growth Data
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL_week_only, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL_week_only
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Results from Fitting the Random Intercept and Slope Model, with Time and Group as Explanatory Variables, to Rat Growth Data
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL_week_only, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL_week_only
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL_week_only
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results from Fitting the Random Intercept and Slope Model that Allows for a Group × Time Interaction to Rat Growth Data
BPRS_ref2 <-lmer(bprs ~ week * treatment + (week | subject), data = BPRSL_week_only, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL_week_only
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
Fitted growth rate profiles from the interaction model and observed growth rate profiles.
Fitted <- fitted(BPRS_ref2)
# Create a new column fitted to RATSL
BPRSL_f <- BPRSL_week_only %>% mutate("Fitted" = Fitted)
# draw the plot of RATSL with the Fitted values of weight
ggplot(BPRSL_f, aes(x = week, y = Fitted, linetype = subject)) +
geom_line() +
facet_grid(. ~ treatment, labeller = label_both) +
#scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "none")
# must be divided into two, otherwise will look jagged
First Five Patients in Each Treatment Group of the “Beat the Blues” (BtB) Clinical Trial of CBT for Depression
–> not interesting, not done
Box plots of BDI scores by occasion of recording and treatment group.
# this is just the same exercise as in Exercise 6 copied beneath, we have pretty much done every meaningful chart
BPRSL8S <- BPRSL_week %>%
filter(week > 0) %>%
group_by(treatment, subject) %>%
summarise( mean=mean(bprs) ) %>%
ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the
## `.groups` argument.
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
Scatterplot matrix of BDI scores.
Emm… already done.
Results from Fitting a Multiple Linear Regression Model to BtB Data Assuming the Repeated Measurements of BDI are Independent
Same, repeating the same from above (the book changed datasets in the middle)
Thank you for reading thus far, and apologies for the really awful format of this last assignment. Just didn’t have time, due to life. Score accordingly. Peace out!